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1 Introduction 


Lunar rovers are used to perform various in-situ scientific experiments. This entails 
a rover to understand the terrain accurately so that it can traverse to scientifically 
interesting areas without running into obstacles. This requires sensing the position 
and height of the obstacles accurately. To locate the obstacles concerning the rover, 
depth computation is inevitable. Depth can be estimated using active and passive 
methods. The active methods include lidar, laser-based, depth camera, and time of 
flight camera methods. The passive methods use stereo cameras. The power consump- 
tion by the active methods is much more than passive methods. Passive methods only 
require a normal camera to be setup in a stereo setup. This makes the use of pas- 
sive methods more feasible and, therefore, the same has been used in this analysis. 
Images captured by a single camera can only provide two-dimensional information. 
Three-dimensional reconstruction from the two-dimensional data available is to be 
estimated by using a stereo camera. A point in the world coordinate will appear in 
the left and right camera with different horizontal coordinates but the same vertical 
coordinate. The difference in the horizontal coordinate is called disparity. Using this 
disparity, depth can be computed. There are four steps to compute disparity. The first 
one is cost computation. Followed by cost aggregation. The third step is disparity 
computation using minimum cost and finally, disparity refinement is done. Disparity 
maps have been generated by making use of local, semi-global and global methods 
in the past. Local methods allow for fast processing time, however, lead to inaccurate 
disparity maps. Global methods are very efficient in estimating the disparities but are 
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unsuitable for real-time applications due to their long processing time. In this work, 
a good quality disparity map is generated for the available dataset replicating the 
surface of the moon using Semi-Global Matching(SGM). Comparison of disparity 
maps produced by the methods mentioned above has been performed and evaluated 
on the Middlebury stereo vision dataset [1]. The contents of this paper have been 
organized as follows, Sect.2 contains the literature survey, Sect.3 brings out the 
methodology used, Sect.4 contains the results and evaluation, Sect.5 contains the 
conclusions and future directions. 


2 Literature Review 


In the work reported in [2], a dense stereo matching method for planetary rovers 
has been proposed. The KITTI dataset [3] and stereo images of the Chang’e-3 rover 
have been used for this analysis. A pixel-wise disparity search range is restricted 
to a few pixels at coarse levels. This method is useful at depth discontinuities, low 
texture regions, and occlusion regions. The method proposed in [4] uses an adap- 
tive window to perform stereo matching on real stereo images. An initial estimate 
of the disparity is produced statistically. The disparity is then calculated iteratively 
until the algorithm converges. An assumption that the intensities within a window 
follow a zero-mean gaussian white distribution when the center pixel is considered 
is made. The work demonstrated in [5], proposes an FPGA-based embedded system 
for stereo vision. The matching cost is computed for each pixel in the reference 
and candidate image. Each reference pixel is compared to each candidate pixel in 
terms of a dissimilarity metric computed by aggregating the matching costs within 
an aggregation window of fixed radius. The datasets used for this analysis were Mid- 
dlebury benchmarks Tsukuba, Venus, Teddy, and Cones. The mentioned interfaces 
require additional costly hardware peripherals and initialization procedures result- 
ing in complex systems with resource-consuming FPGA designs involving external 
memories and CPUs. Local stereo matching methods are area-based and have very 
fast processing speeds suitable for real-time applications [6]; however, their accu- 
racy is low. Pixels in the reference image are matched by looking at neighboring 
pixels in the corresponding match image within a window. In this analysis, compar- 
isons between different compatible stereo algorithms like SAD, SSD, NCC, SAD 
by derivatives, and non-parametric census transform are presented. Computation of 
disparity map, both grayscale, and color are performed on the Middlebury Stereo 
Dataset by making use of Ground Truth disparities. Evaluation of accuracy is done 
on the Middlebury dataset using RMS error and BAD PIXEL match as quality met- 
rics. The method proposed in [7] uses a belief propagation-based global method 
in real-time to obtain top-notch results. The algorithm contains two parts. The cor- 
relation volume computation constructs the data term and the belief propagation 
updates the smoothness term to minimize the energy. The performance efficiency 
is attributed to parallelism in the hardware enabling a speedup of 45 times as com- 
pared to the CPU implementation. The runtime of a belief propagation algorithm 
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based on adaptively updating pixel cost is linear to the number of iterations. Due to 
a large number of iterations, implementation on real-time applications is infeasible. 
Unlike general belief propagation methods, the runtime of this algorithm converges 
by a large amount. The results were obtained for the Middlebury dataset. The work 
reported in [8] is aimed at establishing a method for stereo processing using SGM 
based on mutual information. Matching based on mutual information is proposed by 
combining 1D constraints to form a single 2D constraint. The refinement procedures 
work comparatively better in presence of occlusion and outliers in the map. Deep 
learning methods have become extensively popularized in recent years. They have 
also been made use of in stereo matching to obtain results with high accuracy and 
inference speed. FADNet [9] is a Fast and Accurate Disparity estimation Network 
that makes use of the DispNetC architecture, Scene Flow and KITTI 2015 are used to 
evaluate the performance of this network. Another method proposed by Sun et al. [10] 
called Disp R-CNN computes disparity only for specific pixels containing objects of 
interest instead of the entire image hence leading to a faster runtime. Wang et al. [11] 
proposed a SMAR-Net based on generative adversarial learning which consists of a 
disparity regressor and left image generator in a two-stage network by minimizing 
content loss and adversarial loss. Deep learning methods require GPU utilization and 
hence prove to be infeasible for real-time applications. The need for a large number 
of ground truth data for training these huge networks poses a limitation, despite the 
generation of pseudo-ground truth images in [10, 11]. Although local stereo match- 
ing methods have been widely used for their fast processing speeds, the SGM method 
provides results with a much-improved accuracy, but with slightly higher processing 
times. Global methods require the usage of external memory and GPU due to their 
high computational load and hence are unsuitable for real-time applications. Imple- 
mentation of SGM can be done on FPGA efficiently. In the proposed work SGM has 
been used for computing the disparity map of lunar surface images in real-time. Such 
an attempt for images captured by the lunar rover has not been carried out so far. 


3 Methodology 


Figure | illustrates the workflow in the form of a block diagram. 


3.1 Camera Calibration 


Calibration of Stereo cameras is crucial to perform image rectification. There are two 
types of camera calibrations parameters; intrinsic camera parameters and extrinsic 
camera parameters. Intrinsic camera parameters provide us with the internal proper- 
ties of the camera. These parameters include focal length, optical center, aspect ratio, 
distortion coefficients, and shear constant. Extrinsic camera parameters talk about 
external parameters of the camera; these parameters are important because they give 
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Fig. 1 Block diagram of workflow 


Fig. 2 Left and right 
checkerboard image pair 
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DISPARITY 
ESTIMATION 


us the exact baseline distance between the cameras. Fifteen pairs of left and right 
chessboard images in various positions and tilts obtained from the Indian Space 
Research Organization (ISRO) were used for calibration. The size of the checker- 
board pattern was 30 mm. The left and right checkerboard image pairs are illustrated 


in Fig. 2. 


The calibration algorithm assumes a pinhole camera model: 


where 


(wiz ylJS(4¥Z1) LAY K) 


(X, Y, Z): world coordinates of a point 
(x, y): coordinates of the corresponding image point 


w : arbitrary scale factor 
K: camera intrinsic matrix 


R: matrix representing the 3D rotation of the camera 


t: translation of the camera relative to the world coordinate system 
T: Transpose of the matrix 


(1) 
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Fig. 3 Epipolar geometry and epipolar geometry rectified 


3.2 Image Rectification 


Image rectification is a process that projects multiple images onto the same image 
surface. It helps correct a distorted image into a standard coordinate system. It ensures 
that the images appear as though they have been taken only at a horizontal displace- 
ment. 

There are two methods of computing image rectification, without using camera 
calibrations and making use of camera calibrations. In this analysis, image rectifi- 
cation has been performed making use of camera calibrations to avoid distortions. 
Image rectification makes use of camera parameters to derive transformation matri- 
ces that perform the projective transformation on the left and right images using 
the fundamental camera matrix. This results in rotation of the left and right images 
to have axes parallel to the baseline axis. The second transformation results in the 
perfect alignment of the image optical axes with the baseline axis. We finally scale 
both the images to the same image resolution. A reduced search space is obtained to 
find the corresponding pixel match in the left and right images. 

In Fig.3, O and O’ represent the centers of the left and right camera lenses and 
P is the 3D point of interest. We focus on p and p’ which are the projections of 
the point P onto the camera planes. The darkened boxes represent the rectified left 
and right image planes after undergoing projective transformations. The resulting 
epipolar lines e and e’ are now parallel to the horizontal axis. 


3.3. Stereo Anaglyph 


Stereo Anaglyph is a special type of image which allows our eyes to perceive three- 
dimensional information using two-dimensional images when viewed through a spe- 
cial glass. It is obtained by superimposing the images taken from stereo cameras with 
different fields of view and printed in monochrome colors, usually red and cyan. Thus, 
each eye sees different images which makes the brain automatically perceive depth. 
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Fig. 4 Stereo Anaglyph 
before rectification with a 
100.06-pixel difference 


Image rectification can be better visualized by observing the stereo anaglyphs of 
the image pairs as shown in Figs.4 and 5. The maximum disparity and minimum 
disparity used for Figs.4 and 5 are 304 and 16 respectively. 


3.4 Disparity Estimation 


Disparity refers to parallax or the difference in location of an object point due to the 
horizontal separation of the eyes. This can be realized by focusing on an object with 
one eye closed and immediately switching to the other eye. It can be seen that there 
is a slight motion in the location of the object while doing so. Stereo cameras are 
separated horizontally along the same axis while capturing the image of the same 
scene. The concept described above can be directly applied to stereo cameras as well. 
The disparity can be defined as the pixel difference between the corresponding points 
in the reference and match image. This difference can be made use of to derive the 
depth information of a scene or object whose image is captured by stereo cameras. 
When correspondence has been established for all points in an image, a disparity 
map can be obtained with each point representing the apparent pixel difference of 
corresponding points in the two images. 


Local Methods of Disparity Estimation The sum of absolute differences (SAD) 
is a similarity metric that is used to calculate the degree of similarity between two 
images. Itis done so by taking the absolute difference in the intensities of all the pixels 
in one image and its corresponding pixels in the other. These differences are then 
summed. The sum of squared differences (SSD) is a variation of the sum of absolute 
differences with an additional step. Calculation of the same is done by taking the 
squared difference of the intensity values assumed by each pixel in one image and 
its corresponding pixel in the other. These squared differences are then summed. 
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Normalized Cross-Correlation (NCC) is another metric that is used to calculate the 
degree of similarity between two images in comparison. It is a parametric method 
that depends on the actual values of intensities held by the pixels in the image; 
however, the main advantage of this method is that it performs well in the presence 
of illumination changes. This is because normalization is performed in this method 
before comparison between images. Hence, the range of all pixels is confined to 
[—1, 1]. 


Semi-Global Matching (SGM) SGM is a stereo matching algorithm that lies in 
between local and global methods used to obtain the disparity map for a rectified 
stereo image pair. Pixel-wise matching can be performed using many methods, from 
simple intensity-based matching to matching using Mutual Information. All these 
methods try to find the correspondence for every pixel based on a similarity measure. 
The matching cost is calculated for a base image pixel p and its corresponding match 
pixel g in the matched image which is expected to lie on the epipolar line for rectified 
images. The method used by the authors for pixel-based matching cost computation 
is the Census Transform. 

Census Transform is a non-parametric image transformation that associates with 
each pixel of a grayscale image, a binary string based on a specific criterion. The 
transformation does not depend directly on the intensity values associated with each 
pixel but on the relative ordering of its intensities under a fixed-size window. Census 
Transform is given by [8]. 


Oif p> p’ 


lifp <p’ (2) 


C(p, p')) = 


Hamming Distance is used as a distance metric/inverse similarity metric between 
any two-bit strings resulting from Census Transform. In practical implementation, 
the Census Transform outputs of the left and right rectified images are pixel-wise 
XOR’d and the set bits are counted to generate matching cost. The objective of cost 
aggregation is to minimize the energy term. Due to noise, wrong pixel cost calculation 
could be lower than correct pixel cost calculation. Hence, additional constraints are 
added to increase smoothness (term 2 and term 3) [8]. 


E(D) = )\C(p, Dp) + > PiT[|Dp — Dgl = 11+ > PoT[|Dp — Dl > 1 
P qeNp qeNp 
(3) 


The first term corresponds to the sum of all pixel matching costs which have the 
disparity as D. The second and third terms are penalty terms. The second term 
corresponds to the penalty when disparity changes by 1 pixel and the third term 
correspond to the penalty when the difference in disparity is greater than 1. Using a 
lower penalty for disparity changes of 1 pixel ensures adaptation on curved surfaces. 
A higher penalty for larger changes in disparity preserves discontinuities. Therefore, 


P2> Pi. 
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Minimization of the Energy function along the rows can be minimized using 
Dynamic Programming. However, this would lead to very large constraints in one 
direction combined with weaker constraints in other directions. To solve this issue, 
aggregation of costs is done in one direction from all 8 directions equally (NE, N, 
NW, W, SW, S, SE, E). The cost can be defined by a recursive function computed 
using Eq. 4 [8]. 


L,(p, 4) = C(p, d) + min(L,(p — 1, d), L,(p —1r,d — 1) 
+ P1I,L;(p—r,d+1)+ Pl, minL,(p —r,i) + P2) (4) 


= min Li(p — 1.) 


4 Results 


The major problem faced while capturing images of the lunar surface is that of illu- 
mination. The low lighting conditions make the process of estimating depth difficult. 
Navcam makes use of visible light while capturing the stereo image pair. Similar low 
illumination conditions have been created by the authors in the laboratory and made 
use of to ensure the generation of efficient disparity maps. 

Occlusion problems in stereo vision restrict the generation of good disparity maps. 
The occlusion problem is faced when regions behind objects are visible only in one 
field of view and are hidden in the other. This leads to uncertainty in the value of 
disparity for occluded pixels resulting in holes. To overcome this, the occluded pixels 
must first be detected after performing a thorough left-right consistency check. The 
proposed method has been tested and verified on the experimental data. 


4.1 Quantitative Analysis on the Standard Dataset 


Comparison of the algorithms has been performed using root mean squared error for 
the disparity maps obtained using NCC and SGM algorithm with the true disparity 
values available. 

Root mean square error is calculated as: 


1 n 
— ee Se 49 2. 
RMSE = | - dM yi) (5) 


RMSE = root mean squared error 
n =no of data point 

Y; = observed values 

y; = predicted values 
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Fig. 5 Stereo Anaglyph 
after rectification with the 
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The results have been calculated only for those pixels in the true disparity which 
do not have a 0 value. The maximum disparity range used to obtain the disparity 
map for SGM is 64. Normalization of the ground truth and disparity image is done 
by using the maximum disparity value. The first 64 pixels in the left image have no 
corresponding match, leading to holes in the obtained disparity map. The first 64 
pixels in the ground truth image are neglected while performing root mean squared 
error comparison. 

The disparity maps obtained on the Cones dataset can be seen in Fig. 6. The results 
obtained are shown in Table |. The proposed method works best in cases of occlusion 
which is the aim of this paper. The standard Middlebury datasets show little to no 
occluded pixels in their images, however, a significant decrease in the root mean 
square error can be seen in the case of Teddy images using the proposed method. 


4.2 Qualitative Analysis 


Topographic conditions similar to those on the lunar surface have been mimicked 
in the laboratory while capturing stereo image pairs for experimentation. Disparity 
maps for the ISRO dataset using the SGM algorithm can be seen from Figs. 7,8 and 9. 


Table 1 Root mean square error of the disparity map generated on the Middlebury dataset in pixels 
(px) 


Technique Cones (px) Teddy (px) 
NCC [6] 6.727 7.053 
Segment tree [12] 4.663 4.164 
Non-local [12] 4.385 3.878 
Guided filter [12] 2.919 3.662 
SGM [8] 2.512 2.758 
Modified-SGM (proposed method) 2.761 2.29 
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True Disparity map for — 
Cones Disparity map using NCC Disparity map using SGM 


Color coded Disparity Disparity map after per- Disparity map obtained 
Map using SGM forming LRC using modified-SGM 


Fig. 6 Disparity maps obtained on Cones dataset 


Left Rectified image for 
set 1 Disparity map using SGM map using SGM 


Color-coded disparity 


Fig.7 Set 1 


4.3, Occlusion Handling for a Wide-Baseline Stereo Camera 
(Modified-SGM) 


The process involves discarding pixels in both the left and right disparity maps 
which do not hold the same disparity values due to wrong matches from occlusion 
and mismatch. Hole filling can then be performed by finding for each hole, the first 
pixel on the left and the first pixel on the right with a valid disparity value, and then 
replacing the discarded pixel with their minimum(min(left, right)). The resulting 
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Left Rectified image for Color-coded disparity 
set2 Disparity map using SGM map using SGM 


Fig. 8 Set 2 


Left Rectified image for . 
set3 Disparity map using SGM map using SGM 


3 Color-coded disparity 


Fig.9 Set 3 


map has been refined further using Weighted Median Filtering. Figure 10 shows the 
disparity maps for occlusion handling cases using the method stated above. 

It has been observed that the proposed procedure computes the disparity with 
minimum error while taking care of occlusion problems using a normal stereo camera 
setup. This can be used on-board thus giving the disparity real-time to the lunar rover 
to position itself and navigate through an appropriate path with a significant reduction 
in computation and cost. 


5 Conclusion and Future Directions 


The quantitative analysis performed on the disparity maps obtained proves that the 
SGM method generates the disparity map with the least error. Therefore, SGM has 
been used to generate the disparity map from the dataset obtained from ISRO with an 
additional down-sampling step. Qualitative analysis has been done on the dataset as 
obtaining ground truth for real scene images was not feasible. Good quality images 
have been produced as can be seen in Sect. 4. The maximum and the minimum dis- 
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Fig. 10 Results obtained 


parity considered are 304 and 16 respectively. A high disparity value is necessary 
to obtain accurate and efficient results for objects of interest; objects closer to the 
camera plane. The left-right consistency check has been performed to get rid of 
mismatched and occluded pixels followed by hole filling. The current method of 
generating disparity maps involves offline computation as the rover has low com- 
putational power. The proposed method is suitable for real-time computation and 
enables FPGA implementation for on-board disparity map generation. 

The work could be enhanced by making use of images of the moon captured 
in real-time. Global methods could not be implemented as GPU utilization was a 
constraint. A larger disparity search range could have been considered for more 
accurate results but the search time would increase proportionally. A larger search 
space could be considered for an application that is not required to be implemented 
in real-time. Stereo cameras cannot handle occlusion. In the future, a combination 
of stereo cameras and lidar-based cameras can help to overcome this problem. 
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